NYC Green Taxi Analysis
Posted on Thu 25 August 2016 in posts
%pylab inline
Question 1¶
- Programmatically download and load into your favorite analytical tool the trip data for September 2015.
- Report how many rows and columns of data you have loaded.
import pandas as pd
pd.options.mode.chained_assignment = None
from __future__ import division
green = pd.read_csv('https://storage.googleapis.com/tlc-trip-data/2015/green_tripdata_2015-09.csv')
green.head()
green.shape
There is 1,494,926 rows and 21 columns from this NYC Green Taxi Trip data in September 2015
Exploratory Analysis and Data Cleaning¶
By looking through the data, some of the fields seems erroneous to me. For example Passenger_count, Trip_distance, Fare_amount, Extra, MTA_tax, Tip_amount, Tolls_amount, improvment_surcharge and Total_amount have values less than or equal to 0. There was also one trip that the total distance was 600 miles which seem suspecious to me! Since we have about 1.5 million rows loaded, I will simply remove these bad data.
list(green.columns)
green.describe()
# The mysterious 600 miles trip...
green[green.Trip_distance == max(green.Trip_distance)].loc[:,'Trip_distance':]
The above row looks like just bad data so I will simply remove it.
clean_green = green[(green.Passenger_count>0)
&(green.Trip_distance>0) & (green.Trip_distance!=max(green.Trip_distance))
&(green.Fare_amount>0)
&(green.Extra >=0)
&(green.MTA_tax>0)
&(green.Tip_amount>=0)
&(green.Tolls_amount>=0)
&(green.improvement_surcharge>=0)
&(green.Total_amount>0)]
clean_green.shape
clean_green.describe()
Question 2¶
- Plot a histogram of the number of the trip distance ("Trip Distance").
- Report any structure you find and any hypotheses you have about that structure.
import seaborn as sns
plt.figure(figsize=(20,10))
sns.plt.title('Trip Distance Distribution')
sns.distplot(clean_green.Trip_distance, kde=False)
Looking through the histgram of trip distance, seems most of the trips were within 20 miles, only a few of long distance travel. By calculation, 96% of the trip were less than 10 miles and 99% were less than 20 mile. The median from the reports above is 2 miles
print clean_green[clean_green.Trip_distance<10].shape[0]/clean_green.shape[0]
print clean_green[clean_green.Trip_distance<20].shape[0]/clean_green.shape[0]
My hypotheses about this distribution is that it is not cost-effective for people to take a long distance trip on taxi. The average total amount is $14.8 for trips under 20 miles but $86 above 20 miles. People might choose other way of tranportation for long distance travel.
print mean(clean_green[clean_green.Trip_distance<20].Total_amount)
print mean(clean_green[clean_green.Trip_distance>=20].Total_amount)
Question 3¶
- Report mean and median trip distance grouped by hour of day.
- We'd like to get a rough sense of identifying trips that originate or terminate at one of the NYC area airports. Can you provide a count of how many transactions fit this criteria, the average fair, and any other interesting characteristics of these trips.
clean_green['pickup_hour'] = pd.to_datetime(clean_green.lpep_pickup_datetime).apply(lambda x: x.hour)
# Mean and Median Trip Distance by hour of day
clean_green.groupby('pickup_hour').mean()[['Trip_distance']].transpose()
clean_green.groupby('pickup_hour').median()[['Trip_distance']].transpose()
In order to estimate if a pickup or drop-off location is at the Aiport (JFK, LGA and EWR), I used the geopy package: http://geopy.readthedocs.io/en/1.10.0/ to calulate the distance bewteen to coordinates. I took a naive approach that if the distance between the target coordinates and the airports is with 0.5 mile, I will consider that as the aiport locations.
from geopy.distance import vincenty
def nyc_airports(location,threshold = 0.5):
"""Helper function to determin if a coordinates is within the proximity of the airports """
JFK = (40.639722,-73.778889)
LGA = (40.77725, -73.872611)
EWR = (40.6925, -74.168611)
if vincenty(JFK, location).miles <=threshold:
return 'JFK'
elif vincenty(LGA, location).miles <=threshold:
return 'LGA'
elif vincenty(EWR, location).miles <=threshold:
return 'EWR'
else:
return 'Other'
clean_green['pickup_coordinates'] = zip(clean_green.Pickup_latitude,clean_green.Pickup_longitude)
clean_green['dropoff_coordinates'] = zip(clean_green.Dropoff_latitude,clean_green.Dropoff_longitude)
clean_green['pickup_location'] = clean_green.pickup_coordinates.apply(nyc_airports)
clean_green['dropoff_location'] = clean_green.dropoff_coordinates.apply(nyc_airports)
clean_green.loc[:,'pickup_coordinates':].head()
Now let's look at how we are doing in finding trips that originate or terminate at one of the NYC area airports!¶
clean_green.groupby('pickup_location').size()
clean_green.groupby('dropoff_location').size()
I am bit suprised that there were few trips originated from the airports but many termineted at the airport! However after googling about NYC green taxi (http://www.nyc.gov/html/tlc/html/passenger/shl_passenger.shtml), I found out that "Boro Taxi drivers can pick up passengers from the street in northern Manhattan (north of West 110th street and East 96th street), the Bronx, Queens (excluding the airports), Brooklyn and Staten Island and they may drop you off anywhere." So it turned out that these green taxi are not supposed to pickup at airports!
# Figuring out airport trips and non-airport trips
masks = (clean_green.pickup_location != 'Other') | (clean_green.dropoff_location != 'Other')
airport_trips = clean_green[masks]
non_airport_trips = clean_green[~masks]
print airport_trips.shape[0]
print mean(airport_trips.Total_amount)
Transactions fit this criteria is about 18,300, and the average fare of the trip is $34.2. I'm also interested in the when these airport trips happened during the day comparing other trips and is there any difference in the tip percentage
plt.figure(figsize=(8,6))
sns.plt.title('Airport Trips Pickup Hour Distribution')
sns.distplot(airport_trips.pickup_hour,kde=False)
plt.figure(figsize=(8,6))
sns.plt.title('Non-Airport Trips Pickup Hour Distribution')
sns.distplot(non_airport_trips.pickup_hour,kde=False)
It's interesting to notice the difference in distributions by hours. For Aiport trips, seems most of the trips happened in the morning and afternoon. Very few trips were observed after 8:00 pm, probbably has to do with fewer flights in later everning. For non-aiport trips, we can see the spike around 6:00 pm and declining throughout the evening. It could be people going home after work or after night life.
mean(non_airport_trips.Tip_amount/(non_airport_trips.Total_amount-non_airport_trips.Tip_amount))
mean(airport_trips.Tip_amount/(airport_trips.Total_amount-airport_trips.Tip_amount))
It's also interesting to see the difference between the tip percentage in airport and non-airport trips. 12.7% for airport trips and 8.4% for non-airport trips. It could be that passenesers got help for their luggages are more likely to tip more for their drivers.
Question 4¶
- Build a derived variable for tip as a percentage of the total fare.
- Build a predictive model for tip as a percentage of the total fare. Use as much of the data as you like (or all of it). We will validate a sample.
For better capturing tip as a percentage of the total fare, I will only select the credit card transactions since according to the data dictionary Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
tip_data = clean_green[clean_green.Payment_type == 1 ]
#Build the target variable tip as a percentage of the total fare
tip_data['tip_perc'] = tip_data.Tip_amount/(tip_data.Total_amount-tip_data.Tip_amount)
plt.figure(figsize=(20,10))
sns.plt.title('Tip Percentage Distribution')
plt.xlim(0, 1)
sns.distplot(tip_data.tip_perc,kde=False)
It' interesting to see spikes aornd 20%, 25% 30% , since these are defalut tip percentage during checkout.
Feature Engineering and Selection¶
# Creating a variable trip_time in minutes
tip_data['trip_time'] = (pd.to_datetime(clean_green.Lpep_dropoff_datetime)-
pd.to_datetime(clean_green.lpep_pickup_datetime)).astype('timedelta64[s]')/60
# Creating a variable total_fare for the amount paid exclusing tips
tip_data['total_fare'] = tip_data.Total_amount-tip_data.Tip_amount
# Creating dummy variables for hour of the day
dummies = pd.get_dummies(tip_data['pickup_hour'],prefix='hour')
tip_data = pd.concat([tip_data, dummies], axis=1)
# Creating a variable for airport trips
tip_data['airport'] = ((tip_data.pickup_location != 'Other') | (tip_data.dropoff_location != 'Other')).astype(int)
#feature columns
X_columns = ['Passenger_count','Trip_distance','trip_time','total_fare','airport',
'hour_0','hour_1','hour_2','hour_3','hour_4','hour_5','hour_6','hour_7','hour_8',
'hour_9','hour_10','hour_11','hour_12','hour_13','hour_14','hour_15','hour_16',
'hour_17','hour_18','hour_19','hour_20','hour_21','hour_22','hour_23']
y_column = ['tip_perc']
Running a linear regression to predict the tip percentage¶
from sklearn import feature_selection as f_select
from sklearn import linear_model as lm
from sklearn import metrics
from sklearn import cross_validation as cv
# Selecting signicant columns be testing each singel regressors
significant_columns = []
pvals = []
for feature in X_columns:
pval = f_select.f_regression(tip_data[[feature]],tip_data[y_column].values.ravel())
if pval[1][0] < 0.05:
significant_columns.append(feature)
pvals.append(pval[1][0])
# Train Test Split the data
x_train, x_test, y_train, y_test = cv.train_test_split(tip_data[significant_columns],
tip_data[y_column],
test_size=0.333,
random_state=1234)
model = lm.LinearRegression().fit(x_train, y_train)
Model Evaluation and Explaination¶
print pd.DataFrame({'column': significant_columns,
'coef': model.coef_.tolist()[0],
'p-values':pvals}).set_index('column')
print
print 'Metrics on Training Set'
print 'R^2 :', metrics.r2_score(y_train, model.predict(x_train))
print 'MSE :', metrics.mean_squared_error(y_train, model.predict(x_train))
print
print 'Metrics on Testing Set'
print 'R^2 :', metrics.r2_score(y_test, model.predict(x_test))
print 'MSE :', metrics.mean_squared_error(y_test, model.predict(x_test))
From the coefficients we can find intersting things about these features:
- Trip Distance is positvely correlated with the tip percentage while both trip time and total fare has negative impact.
- Trip to airports seems to be correlated with higher tip percentage
- Late night trips seems to indicate a higher tip percentage
Overall the linear regression model performs not very well with a merely 0.005 R^2 on training and 0.002 on testing, which means that with these features we used we can only explain a tiny portion of the variance in the data. Typically when attempts to predict human behavior, it is relative hard to build a good model.
Build a Classifier for high and low perentage tips¶
Instead of doing a regression on the actually percentage of tip, I would like to build a classifier to see if we can predict there will be a large tip percentage using 20% as the threshold.
tip_data['high_tip'] = (tip_data.tip_perc > 0.2).astype(int)
About 25% of the trips have higher than 20% tip
sum(tip_data.high_tip)/len(tip_data.high_tip)
# feature columns
X_columns = ['Passenger_count','Trip_distance','trip_time','total_fare','airport',
'hour_0','hour_1','hour_2','hour_3','hour_4','hour_5','hour_6','hour_7','hour_8',
'hour_9','hour_10','hour_11','hour_12','hour_13','hour_14','hour_15','hour_16',
'hour_17','hour_18','hour_19','hour_20','hour_21','hour_22','hour_23']
y_column = ['high_tip']
# Train Test Split the data
x_train, x_test, y_train, y_test = cv.train_test_split(tip_data[X_columns],
tip_data[y_column],
test_size=0.333,
random_state=1234)
# Build a random forest clf
from sklearn.ensemble import RandomForestClassifier
treeclf = RandomForestClassifier(n_estimators=10, max_depth=None,min_samples_split=1, random_state=0)
treeclf = treeclf.fit(x_train, y_train.values.ravel())
# Evalute the models
print 'fpr', metrics.roc_curve(y_test, treeclf.predict(x_test))[0][1] #fpr
print 'tpr', metrics.roc_curve(y_test, treeclf.predict(x_test))[1][1] #tpr
print 'precision', metrics.precision_score(y_test, treeclf.predict(x_test))
print 'accuracy', metrics.accuracy_score(y_test, treeclf.predict(x_test))
# ROC
roc = metrics.roc_curve(y_test, treeclf.predict(x_test))
plt.figure()
plt.plot([0, 0.5, 1], [0, 0.5, 1])
plt.plot(roc[0], roc[1])
plt.xlabel('false positive rate')
plt.ylabel('true positive rate')
The Random Forest Classifer is still not doing a very good job identifying the postives with only a 20% Recall and 38 % precison. For next steps I would spent more time on the feature engineering side to extract more features from thess dataset and continue to try different optimization and other classifiers such as SVM
Question 5¶
In this section, I was curious to know if the airport trips I identified in previous questions were accurate. So I plotted the pickup and dropoff location on the NYC area map. The results looks pretty good to me.
def inline_map(map):
"""
Embeds the HTML source of the map directly into the IPython notebook.
This method will not work if the map depends on any files (json data). Also this uses
the HTML5 srcdoc attribute, which may not be supported in all browsers.
"""
map._build_map()
return HTML(''.format(srcdoc=map.HTML.replace('"', '"')))
from IPython.display import HTML
import folium
NYC = (40.7142700, -74.0059700)
map = folium.Map(location=NYC,zoom_start=12)
for each in airport_trips.iterrows():
map.simple_marker(
location = [each[1]['Pickup_latitude'],each[1]['Pickup_longitude']],
clustered_marker = True)
inline_map(map)
map = folium.Map(location=NYC)
for each in airport_trips.iterrows():
map.simple_marker(
location = [each[1]['Dropoff_latitude'],each[1]['Dropoff_longitude']],
clustered_marker = True)
inline_map(map)